Skip to content

Use KvikIO's implementation of file-backed memory mapping #19164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: branch-25.10
Choose a base branch
from

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Jun 13, 2025

Description

Background

Libcudf 25.04 and earlier use memory_mapped_source by default. For host read it uses memory mmaped I/O, and for device read it uses standard I/O.

Libcudf 25.06 uses standard I/O by default. For host read, memory_mapped_source still uses memory mmaped I/O, while the device read is no longer allowed (runtime exception if used). Parquet/ORC readers fall back to the host read + H2D copy to emulate device read for the mapped source.

This PR

Now that the file-backed memory mapping (C++) is supported by KvikIO (rapidsai/kvikio#740), this PR updates libcudf to reinvigorate memory_mapped_source using KvikIO's implementation. This re-enables device read and brings performance improvement (e.g. through parallel prefault).

Notes

memory_mapped_source is an implementation detail in datasource.cpp. Currently testing was conducted on C2C (arm) and PCIe (x86) systems by manually setting LIBCUDF_MMAP_ENABLED=ON and running the tests. Refer to rapidsai/kvikio#530 (comment) for benchmark results.

Dependency

This PR depends on KvikIO PR rapidsai/kvikio#781 to fix unit test errors.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jun 13, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jun 13, 2025
@kingcrimsontianyu kingcrimsontianyu added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed CMake CMake build issue labels Jun 13, 2025
@github-actions github-actions bot added the CMake CMake build issue label Jun 13, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 51ee3b2

Copy link

copy-pr-bot bot commented Jul 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu moved this to Burndown in libcudf Jul 16, 2025
@github-actions github-actions bot removed the CMake CMake build issue label Jul 23, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

Copy link

copy-pr-bot bot commented Jul 23, 2025

/ok to test

@kingcrimsontianyu, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test bccb387

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review July 23, 2025 20:59
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner July 23, 2025 20:59
@kingcrimsontianyu kingcrimsontianyu changed the base branch from branch-25.08 to branch-25.10 July 24, 2025 15:09
rapids-bot bot pushed a commit to rapidsai/kvikio that referenced this pull request Jul 25, 2025
cuDF PR rapidsai/cudf#19164 currently has 4 failed unit tests when `LIBCUDF_MMAP_ENABLED=ON`:
```
28 - CSV_TEST (Failed)
29 - ORC_TEST (Failed)
32 - JSON_TEST (Failed)
40 - DATA_CHUNK_SOURCE_TEST (Failed)
```
The fix entails code changes on both the KvikIO and cuDF sides.
On the KvikIO side, the `MmapHandle::read()` and `MmapHandle::pread()` methods need to:
- Allow the read size to be 0
- Allow `offset` to be equal to `initial_map_offset` (when the read size is 0)

This PR makes this change. In addition, this PR adds more detailed error messages when out-of-range exception occurs.

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #781
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

1 participant